Hive架构、部署、配置

Posted by Jackson on 2017-09-02

Hive 概述

Hive 官网 https://hive.apache.org/

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

Facebook开源,用于解决海量结构化的日志数据统计问题
构建在Hadoop(HDFS/MapReduce/YARN)之上的数据仓库
Hive的数据是存放在HDFS之上

  • Hive底层执行引擎:MapReduce/Tez/Spark
  • 只需要通过一个参数就能够切换底层的执行引擎
  • Hive作业提交到YARN上运行
  • 提供了HQL查询语言,和SQL类似,但不完全相同
  • 适用于离线处理/批处理

基于Hadoop处理大数据的弊端:

  • 面向MapReduce麻烦(编写Map、Reduce、Main方法、提交jar包)
  • 对数据库人员不友好

Hive产生的背景:
很难对HDFS上面的文件进行相应的业务统计分析,使用MR会损耗大量的时间

Hive的优缺点:

优点:可以使用SQL语句进行统计分析和查询

缺点:如果底层的引擎是MR,那么Hive的性能必然是不高的

Hive 的架构

Hive架构图
avatar

架构图解析

  • 用户接口层:命令行、JDBC
  • Driver/驱动器:
  • SQL解析:SQL ===> AST(antlr)将SQL解析成为抽象语法树,使用antlr这种工具
  • 查询优化:逻辑/物理执行计划,生成逻辑执行计划和物理执行计划
    UDF/seDes:用户自定义函数,实现序列化和反序列化
  • Execution:最终生成执行计划在MR中运行。
    MeteStore:元数据信息包含表名、列称、表类型、表数据所在目录。。。

Hive的数据是存放在HDFS之上的,分为两部分:数据+元数据(一般存储在MySQL上面)

Hive和传统关系型数据库的区别?

相同点
都支持SQL语法
都支持事务
不同点
数据量上面Hive处理的要多,
延迟性上面Hive是批处理,延迟比较高,传统的关系型数据库要求的延迟比较低

注意:Hive中的Insert、Update、Delete 只有在Hive-0.14及以后才支持,我们对Hive的操作更多的是批次加载数据到Hive中然后进行统计分析

Hive的部署

官网参考地址https://cwiki.apache.org/confluence/display/Hive/GettingStarted

1.前置条件

  • JDK8
  • Hadoop2.x
  • Linux
  • Mysql

2.下载tar包
wget http://archive.cloudera.com/cdh5/cdh/5/hive-1.1.0-cdh5.16.2.tar.gz

3.解压tar -zxvf hive-1.1.0-cdh5.16.2.tar.gz -C ~/app/

4.修正用户和用户组chmod -R hadoop:hadoop /home/hadoop/app/hive/* /home/hadoop/app/hive-1.1.0-cdh5.16.2/*

5.软连接ln -s hive-1.1.0-cdh5.16.2/ hive

6.配置环境变量

1
2
3
4
5
6
7
8
[hadoop@bigdata01 app]$ cd ~
[hadoop@bigdata01 ~]$ vim .bashrc
export HIVE_HOME=/home/hadoop/app/hive
export PATH=$HIVE_HOME/bin:$PATH

[hadoop@bigdata01 ~]$ source .bashrc
[hadoop@bigdata01 ~]$ which hive
~/app/hive/bin/hive

7.拷贝MySQL驱动包到$HIVE_HOME/lib/下

8.Hive 配置文件

Hive中是没有hive的template的,需要自己创建一个hive-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://bigdata01:3306/bigdata_hive?createDatabaseIfNotExist=true</value>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>

<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
</configuration>

Hive日志查看

进入Hive客户端执行show databases;报错

1
2
3
FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: 
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

进入/tmp/hadoop查看hive.log 完整的日志信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
Unable to open a test connection to the given database. JDBC url = jdbc:mysql://192.168.52.50:3306/bigdata_hive?createDatabaseIfNotExist=true, username = root. Terminating connection pool (set lazyInit to true
if you expect to start your database after your app). Original Exception: ------
java.sql.SQLException: Access denied for user 'root'@'bigdata01' (using password: YES)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1078)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4237)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4169)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:928)
at com.mysql.jdbc.MysqlIO.proceedHandshakeWithPluggableAuthentication(MysqlIO.java:1750)
at com.mysql.jdbc.MysqlIO.doHandshake(MysqlIO.java:1290)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2493)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2526)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2311)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:834)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:47)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:416)
at com.mysql.jdbc.NonRegisteringDriver.connect(NonRegisteringDriver.java:347)
at java.sql.DriverManager.getConnection(DriverManager.java:664)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at com.jolbox.bonecp.BoneCP.obtainRawInternalConnection(BoneCP.java:361)
at com.jolbox.bonecp.BoneCP.<init>(BoneCP.java:416)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:120)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:501)
at org.datanucleus.store.rdbms.RDBMSStoreManager.<init>(RDBMSStoreManager.java:298)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:420)
at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:449)
at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:344)
at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:300)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:685)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:663)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:712)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:511)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:78)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:84)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6517)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.<init>(HiveMetaStoreClient.java:207)
at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.<init>(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1660)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.<init>(RetryingMetaStoreClient.java:68)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:83)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3412)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3431)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllFunctions(Hive.java:3656)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:232)
at org.apache.hadoop.hive.ql.metadata.Hive.registerAllFunctionsOnce(Hive.java:216)
at org.apache.hadoop.hive.ql.metadata.Hive.<init>(Hive.java:339)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:300)
at org.apache.hadoop.hive.ql.metadata.Hive.get(Hive.java:275)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.createHiveDB(BaseSemanticAnalyzer.java:201)
at org.apache.hadoop.hive.ql.parse.DDLSemanticAnalyzer.<init>(DDLSemanticAnalyzer.java:222)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzerFactory.get(SemanticAnalyzerFactory.java:265)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:546)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1358)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1475)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1287)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1277)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:226)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)

错误分析:不能实例化hive的元数据信息,可能的原因是mysql数据库未连接成功,检查myslq服务,以及hive-site.xml 中配置的mysql的连接信息,mysql的驱动jar包是否有,这里的问题应该是mysql用户名密码配置有误。

原因:hive-site.xml 中mysql的用户名和密码配置错误

Hive中的命令:

  • !clear; 清屏命令
  • exit; 退出命令
  • use dbname; 切换到dbname所在数据库
  • show tables; 查看当前数据库下的所有表

创建表:create table stu(id int, name string, age int);

查看表结构

1
2
3
4
desc stu;
desc extended stu;
desc formatted stu;
show create table stu;

插入数据:insert into stu values(1,‘tom’,30);

查询数据:select * from stu;

创建stu表,默认存储在HDFS的目录
hdfs://bigdata001:8020/user/hive/warehouse/stu
hive.metastore.warehouse.dir:/user/hive/warehouse
stu:表的名字
表的完整路径是: ${hive.metastore.warehouse.dir}/tablename

1
2
3
4
5
6
7
8
Hive的完整执行日志:
cd $HIVE_HOME
cp hive-log4j.properties.template hive-log4j.properties
hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.file=hive.log

${java.io.tmpdir}/${user.name}/${hive.log.file}
/tmp/hadoop/hive.log

cat hive-log4j.properties

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
[hadoop@bigdata01 conf]$ cat hive-log4j.properties
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Define some default values that can be overridden by system properties
hive.log.threshold=ALL
hive.root.logger=WARN,DRFA
hive.log.dir=${java.io.tmpdir}/${user.name}
hive.log.file=hive.log

# Define the root logger to the system property "hadoop.root.logger".
log4j.rootLogger=${hive.root.logger}, EventCounter

# Logging Threshold
log4j.threshold=${hive.log.threshold}

#
# Daily Rolling File Appender
#
# Use the PidDailyerRollingFileAppend class instead if you want to use separate log files
# for different CLI session.
#
# log4j.appender.DRFA=org.apache.hadoop.hive.ql.log.PidDailyRollingFileAppender

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}

# Rollver at midnight
log4j.appender.DRFA.DatePattern=.yyyy-MM-dd

# 30-day backup
#log4j.appender.DRFA.MaxBackupIndex=30
log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

# Pattern format: Date LogLevel LoggerName LogMessage
#log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
# Debugging Pattern format
log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n


#
# console
# Add "console" to rootlogger above if you want to use this
#

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
log4j.appender.console.encoding=UTF-8

#custom logging levels
#log4j.logger.xxx=DEBUG

#
# Event Counter Appender
# Sends counts of logging messages at different severity levels to Hadoop Metrics.
#
log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter


log4j.category.DataNucleus=ERROR,DRFA
log4j.category.Datastore=ERROR,DRFA
log4j.category.Datastore.Schema=ERROR,DRFA
log4j.category.JPOX.Datastore=ERROR,DRFA
log4j.category.JPOX.Plugin=ERROR,DRFA
log4j.category.JPOX.MetaData=ERROR,DRFA
log4j.category.JPOX.Query=ERROR,DRFA
log4j.category.JPOX.General=ERROR,DRFA
log4j.category.JPOX.Enhancer=ERROR,DRFA


# Silence useless ZK logs
log4j.logger.org.apache.zookeeper.server.NIOServerCnxn=WARN,DRFA
log4j.logger.org.apache.zookeeper.ClientCnxnSocketNIO=WARN,DRFA

#custom logging levels
log4j.logger.org.apache.hadoop.hive.ql.parse.SemanticAnalyzer=INFO
log4j.logger.org.apache.hadoop.hive.ql.Driver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.ExecDriver=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapRedTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.mr.MapredLocalTask=INFO
log4j.logger.org.apache.hadoop.hive.ql.exec.Task=INFO
log4j.logger.org.apache.hadoop.hive.ql.session.SessionState=INFO
[hadoop@bigdata01 conf]$

hive.log.dir=${java.io.tmpdir}/${user.name}为Hive的日志目录
{java.io.tmpdir}为 /tmp 目录
hive.log.file=hive.log 为Hive的log日志文件

Hive配置属性

全局:$HIVE_HOME/hive-site.xml中
临时/当前session:仅对当前session生效

  • 查看当前的属性:set key;
  • 修改当前的属性:set key=value;
  • hive -hiveconf hive.cli.print.current.db=true

Hive中交互式命令
-e:不需要进入hive命令后,就可以跟上sql语句查询
-f:执行指定文件(内容是SQL语句)